Automatic Topic Identification for Large Scale Language Modeling Data Filtering
نویسندگان
چکیده
The paper presents a module for topic identification that is embedded into a complex system for acquisition and storing large volumes of text data from the Web. The module processes each of the acquired data items and assigns keywords to them from a defined topic hierarchy that was developed for this purposes and is also described in the paper. The quality of the topic identification is evaluated in two ways using classic precision-recall measures and also indirectly, by measuring the ASR performance of the topic-specific language models that are built using the automatically filtered data.
منابع مشابه
Application of Lemmatization and Summarization Methods in Topic Identification Module for Large Scale Language Modeling Data Filtering
The paper presents experiments with the topic identification module which is a part of a complex system for acquisition and storing large volumes of text data. The topic identification module processes each acquired data item and assigns it topics from a defined topic hierarchy. The topic hierarchy is quite extensive – it contains about 450 topics and topic categories. It can easily happen that...
متن کاملAutomatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation
Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...
متن کاملIn-Network Phased Filtering Mechanism for a Large-Scale RFID Inventory Application
-RFID technology is one of automatic identification technologies. In current RFID systems, RFID data are managed and processed by a middleware. In the near future, when RFID technology will be applied to large scale warehouses, airports, or seaports, it is necessary that wireless sensors integrated a RFID reader construct wireless sensor network because of difficulties of building wired network...
متن کاملBusiness Rule Based Extension of a Semantic Process Modeling Language for Managing Business Process Compliance in the Financial Sector
Managing business process compliance is an important topic in the financial sector. Various scandals and the financial crisis have caused many new constraints and legal regulations that banks and financial institutions have to face. Based on a domain-specific semantic business process modeling notation we propose generic process compliance business rules that serve as a first step towards the i...
متن کاملLarge Scale Distributed Acoustic Modeling With Back-Off ℕ-Grams
The paper revives an older approach to acoustic modeling that borrows from n-gram language modeling in an attempt to scale up both the amount of training data and model size (as measured by the number of parameters in the model), to approximately 100 times larger than current sizes used in automatic speech recognition. In such a data-rich setting, we can expand the phonetic context significantl...
متن کامل